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Abstract 

We present a data-driven optimal control framework that is derived using the path 
integral (PI) control approach. We find iterative control laws analytically without a 
priori policy parameterization based on probabilistic representation of the learned 
dynamics model. The proposed algorithm operates in a forward-backward manner 
which differentiate it from other Pl-related methods that perform forward sam¬ 
pling to find optimal controls. Our method uses significantly less samples to find 
analytic control laws compared to other approaches within the PI control family 
that rely on extensive sampling from given dynamics models or trials on physical 
systems in a model-free fashion. In addition, the learned controllers can be gener¬ 
alized to new tasks without re-sampling based on the compositionality theory for 
the linearly-solvable optimal control framework. We provide experimental results 
on three different tasks and comparisons with state-of-the-art model-based meth¬ 
ods to demonstrate the efficiency and generalizability of the proposed framework. 


1 Introduction 

Stochastic optimal control (SOC) is a general and powerful framework with applications in many 
areas of science and engineering. However, despite the broad applicability, solving SOC problems 
remains challenging for systems in high-dimensional continuous state action spaces. Various func¬ 
tion approximation approaches to optimal control are available Gill] but usually sensitive to model 
uncertainty. Over the last decade, SOC based on exponential transformation of the value function has 
demonstrated remarkable applicability in solving real world control and planning problems. In con¬ 
trol theory the exponential transformation of the value function was introduced in 00]. In the recent 
decade it has been explored in terms of path integral interpretations and theoretical generalizations 
00.0 .[MIj discrete time formulations 0, and scalable RL/control algorithms [ill[I^[ol[T^ . 

The resulting stochastic optimal control frameworks are known as Path Integral (PI) control for con¬ 
tinuous time, Kullback Leibler (KL) control for discrete time, or more generally Linearly Solvable 
Optimal Control 0 [I^ . 

One of the most attractive characteristics of PI control is that optimal control problems can be solved 
with forward sampling of Stochastic Differential Equations (SDEs). While the process of sampling 
with SDEs is more scalable than numerically solving partial differential equations, it still suffers 
from the curse of dimensionality when performed in a naive fashion. One way to circumvent this 
problem is to parameterize policies flolfllllldl] and then perform optimization with sampling. How¬ 
ever, in this case one has to impose the structure of the policy a-priori, therefore restrict the possible 
optimal control solutions within the assumed parameterization. In addition, the optimized policy 
parameters can not be generalized to new tasks. In general, model-free PI policy search approaches 
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require a large number of samples from trials performed on real physical systems. The issue of 
sample inefficiency further restricts the applicability of PI control methods on physical systems with 
unknown or partially known dynamics. 

Motivated by the aforementioned limitations, in this paper we introduce a sample efficient, model- 
based approach to PI control. Different from existing PI control approaches, our method combines 
the benefits of PI control theory IH 01 and probabilistic model-based reinforcement learning 
methodologies lH^flTl] . The main characteristics of the our approach are summarized as follows 

• It extends the PI control theory ||5l|^0 to the case of uncertain systems. The structural 
constraint is enforced between the control cost and uncertainty of the learned dynamics, 
which can be viewed as a generalization of previous work illEll- 

• Different from parameterized PI controllers ifl^ [Til [T^ [sl]. we find analytic control law 
without any policy parameterization. 

• Rather than keeping a fixed control cost weight |0 ssuaiii, or ignoring the con¬ 
straint between control authority and noise level iflin . in this work the control cost weight 
is adapted based on the explicit uncertainty of the learned dynamics model. 

• The algorithm operates in a different manner c omp ared to existing Pl-related methods that 
perform forward sampling ||5llarT [lOl[l^flllll2l|l4 0. More precisely our method per¬ 
form successive deterministic approximate inference and backward computation of optimal 
control law. 

• The proposed model-based approach is significantly more sample efficient than sampling- 
based PI control I5ll^0 fl8l] . In RL setting our method is comparable to the state-of-the-art 
RL methods tUfH in terms of sample and computational efficiency. 

• Thanks to the linearity of the backward Chapman-Kolmogorov PDE, the learned controllers 
can be generalized to new tasks without re-sampling by constructing composite controllers. 
In contrast, most policy search and trajectory optimization methods lfl0l[lll[l4,[TT[l4l^ 
I^l2^ find policy parameters that can not be generalized. 


2 Iterative Path Integral Control for a Class of Uncertain Systems 

2.1 Problem formulation 

We consider a nonlinear stochastic system described by the following differential equation 

dx = (f(x) -f G(x)u)df -f Bdo), (1) 

with state x £ R”, control u £ M™, and standard Brownian motion noise tu £ with variance 
Stj. f(x) is the unknown drift term (passive dynamics), G(x) £ R"><™ is the control matrix and 
B £ Rraxp jjjg diffusion matrix. Given some previous control we seek the optimal control 
correction term i5u such that the total control u = + (5u. The original system becomes 

dx = (f(x) -f G(x)(u°^‘^ -f Su))dt + Bdu; = (f(x) + G(x)u°^‘^) df + G(x)5udf + Bdtu. 

In this work we assume the dynamics based on the previous control can be represented by Gaussian 
processes (GP) such that 

fGp(x) = f (x, + Bdo;, (2) 

where fcp is the GP representation of the biased drift term f under the previous control. Now the 
original dynamical system ([1]) can be represented as follow 

dx = fcp + G(5udf, fsp(3) 
where /x^, S/ are predictive mean and covariance functions, respectively. For the GP model we use 
a prior of zero mean and covariance function K(xi, Xj) = tr^ exp(—i(xi — Xj)"'’W(xi — x^)) + 
Sija'^, with CTs, (Ttj, W the hyper-parameters. Sij is the Kronecker symbol that is one iff i = j and 
zero otherwise. Samples over fcp can be drawn using an vector of i.i.d. Gaussian variable 17 

fcp = M/ + L/17 (4) 
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where L/ is obtained using Cholesky factorization such that S/ = L/LJ. Note that generally ft is 
an infinite dimensional vector and we can use the same sample to represent uncertainty during learn¬ 
ing dm. Without loss of generality we assume to be the standard zero-mean Brownian motion. 
For the rest of the paper we use simplified notations with subscripts indicating the time step. The 
discrete-time representation of the system is xt+dt = + + and the con¬ 
ditional probability of x^+d* given X( and Juj is a Gaussian p(x(_|_dt |xt, (5ut) = Af S^+dt), 

where fJ-t+dt = + M/t + ^tSut and St+dt = '^ft- In this paper we consider a finite-horizon 

stochastic optimal control problem 


J(xo) = E 9(xt) 


C{xt,Sut)dt 


/t=o 


where the immediate cost is defined as £(X(,U() = q{xt) + ^i5u^Rt(5ut, and q{xt) = (x^ — 
x^)"'’Q(xt — x.f) is a quadratic cost function where x.f is the desired state. R* = R(xt) is a state- 
dependent positive definite weight matrix. Next we show the linearized Hamilton-Jacobi-Bellman 
equation for this class of optimal control problems. 


2.2 Linearized Hamilton-Jacobi-Bellman equation for uncertain dynamics 


At each iteration the goal is to find the optimal control update Juj that minimizes the value function 


V (xt, t) = minE 

(5ut 


£(xt, 5ut)df + V{xt + dxj, t -I- df)df|xt 


(5) 


0 is the Bellman equation. By approximating the integral for a small dt and applying Ito’s rule we 
obtain the Hamilton-Jacobi-Bellman (HJB) equation (detailed derivation is skipped); 


-dtVt = mm{qt + ^-SujRtSut 

Sut Z 


{(1ft + GtSutfy^Vt 


iTr(S/,VxxVt)). 


To find the optimal control update, we take gradient of the above expression (inside the parentheses) 
with respect to Juj and set to 0. This yields Juj = —R^^G^VxV^. Inserting this expression into 
the HJB equation yields the following nonlinear and second order PDF 


1 


1 


-dtVt = qt + {V^Ytf(ift - -(VxVO^GtR-iG^VxVt + - Tr(S/tVxxV0. (6) 

In order to solve the above PDF we use the exponential transformation of the value function 
Vt = — AlogT't, where T't = 'l'(xt) is called the desirability of Xj. The corresponding 
partial derivatives can be found as dtVt = VxVj = —■^Vx^'i and VxxVt = 

■^Vx'ktVx'k?^ — ■^Vxx'I't- Inserting these terms to (|6ll results in 

Wt y/t t t 

The quadratic terms Vx^'t will cancel out under the assumption of AGtRr^GT" = S/j. This 
constraint is different from existing works in path integral control iH 0, flOl [iSl where the 
constraint is enforced between the additive noise covariance and control authority, more precisely 
AGjR^^G^ = The new constraint enables an adaptive update of control cost weight 

based on explicit uncertainty of the learned dynamics. In contrast, most existing works use a fixed 
control cost weight j^lalA flftllSlfllllT^ ISl]. This condition also leads to more exploration (more 
aggressive control) under high uncertainty and less exploration with more certain dynamics. Given 
the aforementioned assumption, the above PDF is simplified as 


1 
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(7) 


dt'i’t = ^qt'd’t - M/tVx^'t - ^ Tr(Vxx«'tS/t) 

A Z 

subject to the terminal condition T't = exp(— The resulting Chapman-Kolmogorov PDF 0 
is linear. In general, solving 0 analytically is intractable for nonlinear systems and cost functions. 
We apply the Feynman-Kac formula which gives a probabilistic representation of the solution of the 
linear PDF 0 

r 1 

p(Tt|xt)exp ( - ^( X! 9jdf))T'Tdrt, (8) 


= lim 

dt —^0 
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where Tt is the state trajectory from time t to T. The optimal control is obtained as 

GtSut = -GtK-^Gj{V^Vt) = AG*Rr'GT(^) = S/t(^) 

+ Sut = uf ^ + G-iS/, (^). 

Rather than computing Vx'I't and T't, the optimal control Ut can be approximated based on path 
costs of sampled trajectories. Next we briefly review some of the existing approaches. 

2.3 Related works 

According to the path integral control theory llH 0,113 EM Utl^ the stochastic optimal control 
problem becomes an approximation problem of a path integral ®. This problem can be solved by 
forward sampling of the uncontrolled (u = 0) SDE ([T]). The optimal control Ut is approximated 
based on path costs of sampled trajectories. Therefore the computation of optimal controls becomes 
a forward process. More precisely, when the control and noise act in the same subspace, the optimal 
control can be evaluated as the weighted average of the noise Ut = Ep(T-j|xt) [dwt], where the 

probability of a trajectory is p(rt|xt) = j is(T*lx*))dT ’ ^'(rflxf) is defined as the path 
cost computed by performing forward sampling. However, these approaches require a large amount 
of samples from a given dynamics model, or extensive trials on physical systems when applied in 
model-free reinforcement learning settings. In order to improve sample efficiency, a nonparametric 
approach was developed by representing the desirability T't in terms of linear operators in a repro¬ 
ducing kernel Hilbert space (RKHS) IU^. As a model-free approach, it allows sample re-use but 
relies on numerical methods to estimate the gradient of desirability, i.e., Vx'I't , which can be com¬ 
putationally expensive. On the other hand, computing the analytic expressions of the path integral 
embedding is intractable and requires exact knowledge of the system dynamics. Furthermore, the 
control approximation is based on samples from the uncontrolled dynamics, which is usually not 
sufficient for highly nonlinear or underactuated systems. 

Another class of Pl-related method is based on policy parameterization. Notable approaches in¬ 
clude PP 113] , PP-CMA iTll] . PI-REPS ildl] and recently developed state-dependent PI|[3l- The 
limitations of these methods are; 1) They do not take into account model uncertainty in the passive 
dynamics f(x). 2) The imposed policy parameterizations restrict optimal control solutions. 3) The 
optimized policy parameters can not be generalized to new tasks. A brief comparison of some of 
these methods can be found in Table 1. Motivated by the challenge of combining sample efficiency 
and generalizability, next we introduce a probabilistic model-based approach to compute the optimal 
control ® analytically. 



PI 15, 6, 7], iterative PI [18] 

PI^ 1101. PI^’-CMA [111 

PI-REPS 1141 

State feedback PI181 

Our method 

Structural constraint 

AGtRjT^G) = BS„B^ 

same as PI 

same as PI 

same as PI 

AGR^^G'^' = Sf 

Dynamics model 

model-based 

model-free 

model-based 

model-based 

GP model-based 

Policy parameterization 

No 

Yes 

Yes 

Yes 

No 


Table 1; Comparison with some notable and recent path integral-related approaches. 


3 Proposed Approach 

3.1 Analytic path integral control: a forward-backward scheme 

In order to derive the proposed framework, firstly we learn the function fGp(xt) = f(x, -f 

Bdu; from sampled data. Learning the continuous mapping from state to state transition can be 
viewed as an inference with the goal of inferring the state transition dxt = fcpjxf). The kernel 
function has been defined in Sec l2.ll which can be interpreted as a similarity measure of random 
variables. More specifically, if the training input x^ and Xj are close to each other in the kernel 
space, their outputs dx^ and dx^ are highly correlated. Given a sequence of states {xq, .. .xt}, 
and the corresponding state transition {dxo,..., dx-p}, the posterior distribution can be obtained 
by conditioning the joint prior distribution on the observations. In this work we make the standard 
assumption of independent outputs (no correlation between each output dimension). 
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To propagate the GP-based dynamics over a trajectory of time horizon T we employ the moment 
matching approach to compute the predictive distribution. Given an input distribution over 

the state St), the predictive distribution over the state at t + dt can be approximated as a 

Gaussian p(xt+df) ~ Af{nt+dt^ St+dt) such that 

Mf+dt = Mt + M/f: St+dt = St + S/t + COV[xt,dxt] + COV[dxt,xt]. (10) 

The above formulation is used to approximate one-step transition probabilities over the trajectory. 
Details regarding the moment matching method can be found in [T^ . All mean and variance 
terms can be computed analytically. The hyper-parameters are learned by maximizing 

the log-likelihood of the training outputs given the inputs . Given the approximation of transition 
probability (fTOl i. we now introduce a Bayesian nonparametric formulation of path integral control 
based on probabilistic representation of the dynamics. Firstly we perform approximate inference 
(forward propagation) to obtain the Gaussian belief (predictive mean and covariance of the state) 
over the trajectory. Since the exponential transformation of the state cost exp(—ig(x)df) is an 
unnormalized Gaussian A/^(x‘^, We can evaluate the following integral analytically 

J A/'(/ij, Sj) exp ( - ig^dtjdXj = |H- ^ exp ( - - x^)'^^Q(I d- - x^)), 

( 11 ) 


forj = t+dt, Thus given a boundary condition T't = exp(—and predictive distribution 
at the final step ^t), we can evaluate the one-step backward desirability ikT-di analytically 

using the above expression (fTTT i. More generally we use the following recursive rule 

= $(Xj, Tij) = J Sj) exp ( - igjdf)T'jdXj, ( 12 ) 

for j = t + dt,T — dt. Since we use deterministic approximate inference based on (fTOl) instead 
of explicitly sampling from the corresponding SDE, we approximate the conditional distribution 
p(xj |xj_dt) by the Gaussian predictive distribution , S^ ). Therefore the path integral 


/. ^ T-dt 

= / p(Tt|xt) exp ^ gjdt))^'Tdrt 

Fr-dt, Sr-dt^ exp ^ — —qr-didt^ j exp ^ dxr dxy-di •••dxt+at 


'^T-dt 


^T-2dt 

= J St+dt) exp - igt+dtdt^^'t+dtdxt+dt = $(xt+dt, ^it+dt)- (13) 

We evaluate the desirability Tij backward in time by successive computation using the above recur¬ 
sive expression. The optimal control law Ut (|9ll requires gradients of the desirability function with 
respect to the state, which can be computed backward in time as well. For simplicity we denote the 
function ‘hjxj, T'j) by Thus we compute the gradient of the recursive expression (fTSl l 

Vx'kj-dt = 'kjVx'hj -I- $jVx'l'j, (14) 

where j = t + dt, ...,T — dt. Given the expression in (fTTT i we compute the gradient terms in (fT4l i as 


d-hj dp(xj) 


Vx*!*, = 

dp(xj) dxt 


9/i. ■ dxt 


9-l>, 


d'Sj 2 
dxt 


<&,■ / dt 


dt 


aSJ to' ^ 

d\T 


= - x,")^^Q(I + 




dt 


dt 


2 ,Q(I+^AE,Q)- 


and 


f d/x^._df ^ dfij dSj_df dSj d/ij_dt 


^ ^F,-d 


dxt 


-f 


as,- ds,- 


j-dt 


as 


j-dt 


dxt 


a/i,_ 


j-dt 


dxt 


The term Vx'I'T-dt is compute similarly. The partial derivatives 


aSj_dt dxt 

apt,. 


}■ 




aS, 

aSi-dt 


can be computed analytically as in llTII . We compute all gradients using this scheme without any 
numerical method (finite differences, etc.). Given and Vx'kt, the optimal control takes a analytic 
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form as in eq.®. Since 'I't and Vx'I't are explicit functions of Xt, the resulting control law is es¬ 
sentially different from the feedforward control in sampling-based path integral control frameworks 
iBS,0[13mi] as well as the parameterized state feedback PI control policies iffil l^. Notice that 
at current time step t, we update the control sequence Ut,...,T using the presented forward-backward 
scheme. Only Ut is applied to the system to move to the next step, while the controls are 

used for control update at future steps. The transition sample recorded at each time step is incorpo¬ 
rated to update the GP model of the dynamics. A summary of the proposed algorithm is shown in 
Algorithm [1] 


Algorithm 1 Sample efficient path integral control under uncertain dynamics 

1: Initialization: Apply random controls uo,..,t to the physical system O, record data. 

2: repeat 
3: for t=0;T do 

4: Incorporate transition sample to learn GP dynamics model. 

5: repeat 

6: Approximate inference for predictive distributions using j, = Uj t, see (fTOl i. 

7: Backward computation of optimal control updates 5ut,..,T, see (fT3T l (fl4ll (i^. 

8: Update optimal controls u.t,..,T = u* t + .. t- 

9: until Convergence. 

10: Apply optimal control Uj to the system. Move one step forward and record data. 

11: end for 

12: until Task learned. 


3.2 Generalization to unlearned tasks without sampling 


In this section we describe how to generalize the learned controllers for new (unlearned) tasks with¬ 
out any interaction with the real system. The proposed approach is based on the compositionality 
theory ll^ in linearly solvable optimal control (LSOC). We use superscripts to denote previously 
learned task indexes. Firstly we define a distance measure between the new target and old targets 
x'^^, k = 1,.., AT, i.e., a Gaussian kernel 


LO 


k 



(15) 


where P is a diagonal matrix (kernel width). The composite terminal cost q{xT) for the new task 
becomes 


g(xT) = -A log 


Ef^i^^exp(-lgfe(xr)) 


(16) 


where q’^{KT) is the terminal cost for old tasks. For conciseness we define a normalized distance 
measure uj^ = which can be interpreted as a probability weight. Based on (fThl l we have 


the composite terminal desirability for the new task which is a linear combination of 


T't = exp 






T- 


fc=l 


(17) 


Since is the solution to the linear Chapman-Kolmogorov PDF Q, the linear combination of 
desirability ( fTTI l holds everywhere from f to T as long as it holds on the boundary (terminal time 
step). Therefore we obtain the composite control 


Ut 


E 






(18) 


The composite control law in (fTSI) is essentially different from an interpolating control law ll^ . It 
enables sample-free controllers that constructed from learned controllers for different tasks. This 
scheme can not be adopted in policy search or trajectory optimization methods such as MM 
[IlilSITIMilllll. Alternatively, generalization can be achieved by imposing task-dependent 
policies ll2^ . However, this approach might restrict the choice of optimal controls given the assumed 
structure of control policy. 
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4 Experiments and Analysis 


We consider 3 simulated RL tasks: cart-pole (CP) swing up, double pendulum on a cart (DPC) 
swing up, and PUMA-560 robotic arm reaching. The CP and DPC systems consist of a cart and a 
single/double-link pendulum. The tasks are to swing-up the single/double-link pendulum from the 
initial position (point down). Both CP and DPC are under-actuated systems with only one control 
acting on the cart. PUMA-560 is a 3D robotic arm that has 12 state dimensions, 6 degrees of 
freedom with 6 actuators on the joints. The task is to steer the end-effector to the desired position 
and orientation. 

In order to demonstrate the performance, we compare the proposed control framework with three 
related methods: iterative path integral control j lSll with known dynamics model, PILCO IT^ and 
PDDP tH]. Iterative path integral control is a sampling-based stochastic control method. It is 
based on importance sampling using controlled diffusion process rather than passive dynamics used 
in standard path integral control lHlalj]- Iterative PI control is used as a baseline with a given 
dynamics model. PILCO is a model-based policy search method that features state-of-the-art data 
efficiency in terms of number of trials required to learn a task. PILCO requires an extra optimizer 
(such as BFGS) for policy improvement. PDDP is a Gaussian belief space trajectory optimization 
approach. It performs dynamic programming based on local approximation of the learned dynamics 
and value function. Both PILCO and PDDP are applied with unknown dynamics. In this work we do 
not compare our method with model-free Pl-related approaches such as ||I3[ni[ll[Il since these 
methods would certainly cost more samples than model-based methods such as PILCO and PDDP. 
The reason for choosing these two methods for comparison is that our method adopts a similar model 
learning scheme while other state-of-the-art methods, such as ll20l] is based on a different model. 

In experiment 1 we demonstrate the sample efficiency of our method using the CP and DPC tasks. 
For both tasks we choose T — 1.2 and df = 0.02 (60 time steps per rollout). The iterative PI 
lUsll with a given dynamics model uses 10^/10^ (CP/DPC) sample rollouts per iteration and 500 
iterations at each time step. We initialize PILCO and the proposed method by collecting 2/6 sample 
rollouts (corresponding to 120/360 transition samples) for CP/DPC tasks respectively. At each trial 
(on the true dynamics model), we use 1 sample rollout for PILCO and our method. PDDP uses 
4/5 rollouts (corresponding to 240/300 transition samples) for initialization as well as at each trial 
for the CP/DPC tasks. Fig. [T]shows the results in terms of rpT and computational time. For both 
tasks our method shows higher desirability (lower terminal state cost) at each trial, which indicates 
higher sample efficiency for task learning. This is mainly because our method performs online re¬ 
optimization at each time step. In contrast, the other two methods do not use this scheme. However 
we assume partial information of the dynamics (G matrix) is given. PILCO and PDDP perform 
optimization on entirely unknown dynamics. In many robotic systems G corresponds to the inverse 
of the inertia matrix, which can be identified based on data as well. In terms of computational effi¬ 
ciency, our method outperforms PILCO since we compute the optimal control update analytically, 
while PILCO solves large scale nonlinear optimization problems to obtain policy parameters. Our 
method is more computational expensive than PDDP because PDDP seeks local optimal controls 
that rely on linear approximations, while our method is a global optimal control approach. Despite 
the relatively higher computational burden than PDDP, our method offers reasonable efficiency in 
terms of the time required to reach the baseline performance. 

In experiment 2 we demonstrate the generalizability of the learned controllers to new tasks using 
the composite control law ( fTSl l based on the PUMA-560 system. We use T — 2 and d/ = 0.02 
(100 time steps per rollout). First we learn 8 independent controllers using Algorithm[T] The target 
postures are shown in Fig. |2] For all tasks we initialize with 3 sample rollouts and 1 sample at each 
trial. Blue bars in Fig. l2bl shows the desirabilities rpT after 3 trials. Next we use the composite law 
(ITSl l to construct controllers without re-sampling using 7 other controllers learned using Algorithm 

[T] For instance the composite controller for task^j^l is found as The 

performance comparison of the composite controllers with controllers learned from trials is shown in 
Fig. |2] It can be seen that the composite controllers give close performance as independently learned 
controllers. The compositionality theory ll^ generally does not apply to poli cy search methods and 
trajectory optimizers such as PILCO, PDDP, and other recent methods 11^ uH 1^ . Our method 
benefits from the compositionality of control laws that can be applied for multi-task control without 
re-sampling. 
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Double pendulum on a cart 



(a) (b) 

Figure 1; Comparison in terms of sample efficiency and computational efficiency for (a) cart-pole 
and (b) double pendulum on a cart swing-up tasks. Left subfigures show the terminal desirability 'I't 
( for PILCO and PDDP, 'I't is computed using terminal state costs) at each trial. Right subfigures 
show computational time (in minute) at each trial. 




Task# 

(a) (b) 

Figure 2: Resutls for the PUMA-560 tasks, (a) 8 tasks tested in this experiment. Each number 
indicates a corresponding target posture, (b) Comparison of the controllers learned independently 
from trials and the composite controllers without sampling. Each composite controller is obtained 
( fTSl l from 7 other independent controllers learned from trials. 

5 Conclusion and Discussion 


We presented an iterative learning control framework that can find optimal controllers under uncer¬ 
tain dynamics using very small number of samples. This approach is closely related to the family 
of path integral (PI) control algorithms. Our method is based on a forward-backward optimiza¬ 
tion scheme, which differs significantly from current Pl-related approaches. Moreover, it combines 
the attractive characteristics of probabilistic model-based reinforcement learning and linearly solv¬ 
able optimal control theory. These characteristics include sample efficiency, optimality and gen- 
eralizability. By iteratively updating the control laws based on probabilistic representation of the 
learned dynamics, our method demonstrated encouraging performance compared to the state-of- 
the-art model-based methods. In addition, our method showed promising potential in performing 
multi-task control based on the compositionality of learned controllers. Besides the assumed struc¬ 
tural constraint between control cost weight and uncertainty of the passive dynamics, the major 
limitation is that we have not taken into account the uncertainty in the control matrix G. Euture 
work will focus on further generalization of this framework and applications to real systems. 


Acknowledgments 

This research is supported by NSF NRI-1426945. 


8 






























References 

[1] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming (optimization and neural computation 
series, 3). Athena Scientific, 7:15-23, 1996. 

[2] A.G. Barto, W. Powell, J. Si, and D.C. Wunsch. Handbook of learning and approximate dynamic pro¬ 
gramming. 2004. 

[3] W.H. Fleming. Exit probabilities and optimal stochastic control. Applied Math. Optim, 9:329-346, 1971. 

[4] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Applications of 
mathematics. Springer, New York, 1st edition, 1993. 

[5] H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Phys Rev Lett, 95:200-201, 2005. 

[6] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical 
Mechanics: Theory and Experiment, ILPllOll, 2005. 

[7] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. AlP 
Conference Proceedings, 887(1), 2007. 

[8] S. Thijssen and H. J. Kappen. Path integral control and state-dependent feedback. Phys. Rev. E, 
91:032104, Mar 2015. 

[9] E. Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 
106(28): 11478-11483, 2009. 

[10] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement 
learning. The Journal of Machine Learning Research, 11:3137-3181, 2010. 

[11] F. Stulp and O. Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceed¬ 
ings of the 29th International Conference on Machine Learning (ICML), pages 281-288. ACM, 2012. 

[12] K. Rawlik, M. Toussaint, and S. Vijayakumar. Path integral control by reproducing kernel hilbert space 
embedding. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, 
IJCAF13, pages 1628-1634, 2013. 

[13] Y. Pan and E. Theodorou. Nonparametric infinite horizon kullback-leibler stochastic control. In 2014 
IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1-8. 
IEEE, 2014. 

[14] V. Gomez, H.J. Kappen, J. Peters, and G. Neumann. Policy search for path integral control. In Machine 
Learning and Knowledge Discovery in Databases, pages 482^97. Springer, 2014. 

[15] K. Dvijotham and E Todorov. Linearly solvable optimal control. Reinforcement learning and approximate 
dynamic programming for feedback control, pages 119-141, 2012. 

[16] M.P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and 
Trends in Robotics, 2(1-2): 1-142, 2013. 

[17] M. Deisenroth, D. Fox, and C. Rasmussen. Gaussian processes for data-efficient learning in robotics and 
control. IEEE Transsactions on Pattern Analysis and Machine Intelligence, 27:75-90, 2015. 

[18] E. Theodorou and E. Todorov. Relative entropy and free energy dualities: Connections to path integral 
and kl control. In 51st IEEE Conference on Decision and Control, pages 1466-1473, 2012. 

[19] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. \n Advances in Neural Infor¬ 
mation Processing Systems (NIPS), pages 1907-1915, 2014. 

[20] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown 
dynamics. In Advances in Neural Information Processing Systems (NIPS), pages 1071-1079, 2014. 

[21] S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimization. In 
Proceedings of the 31st International Conference on Machine Learning (ICML-I4), pages 829-837, 2014. 

[22] J. Schulman, S. Levine, P. Moritz, M. 1. Jordan, and P. Abbeel. Trust region policy optimization. arXiv 
preprint arXiv:1502.05477, 2015. 

[23] P. Hennig. Optimal reinforcement learning for gaussian systems. In Advances in Neural Information 
Processing Systems (NIPS), pages 325-333, 2011. 

[24] J. Quinonero Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in bayesian 
kernel models-application to multiple-step ahead forecasting. In IEEE International Conference on 
Acoustics, Speech, and Signal Processing, 2003. 

[25] C.K.I Williams and C.E. Rasmussen. Gaussian processes for machine learning. MIT Press, 2006. 

[26] E. Todorov. Compositionality of optimal control laws. In Advances in Neural Information Processing 
Systems (NIPS), pages 1856-1864, 2009. 

[27] M.P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search for robotics. In Proceedings 
of 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014. 


9 



